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The coding sequence in DNA molecule is considered as a message necessary to be transferred 
to receiver, the proteins, through a noisy information channel and each nucleotide is assigned a 
respective information weight. With the help of the nucleotide substitution matrix we estimated 
the lower bound of the amount of information carried out by nucleotides which is not subject of 
mutations. We used the calculated weights to reconstruct fc-oligomers of genes from the Borrelia 
burgdorferi genome. We showed, that to this aim there is sufficient a simple rule, that the number 
of bits of the carried information cannot exceed some threshold value. The method introduced by 
us is general and applies to every genome. 

PACS numbers: 87.10.+e, 87.14.Gg 



I. INTRODUCTION 

Since the famous paper by Crick et al.0 it is gener- 
ally accepted that there exists, in all living organisms, a 
code that makes possible information transfer from the 
sequences of four nucleotides in DNA to the sequences of 
20 amino acids in proteins. Namely, the protein sequence 
is coded by codons which are the triplets of nucleotides 
- each corresponding to one amino acid in a protein se- 
quence. There are 64 possible triplets and only 20 amino 
acids. Hence, the genetic code is degenerated. Crick et 
al.p] proved, that codons are always read from the start, 
codon after codon, they do not overlap, there are no com- 
mas between them. Therefore, each strand of the coding 
region of a DNA molecule could be read in three differ- 
ent reading frames and all the statistical analyses of the 
coding properties of the DNA sequences should be done 
in a specific reference system consistent with the triplet 
nature of the genetic code Q, H, Q . Any other analysis 
basying only on a single sequence of nucleotides averages 
the coding information of the sequence and the coding 
structure cannot be observed clearly. Crick formu- 
lated the Central Dogma of molecular biology, stating 
that genetic information first has to be transferred from 
DNA to RNA and next from RNA to protein. The wide 
discussion of the Central Dogma problem can be found 
in papers by Yockey 0-0- The genetic code is univer- 
sal in the sense that it is used by all living organisms 
and there is no found a counterexample until now. Thus, 
there should be the same simple rule in using the genetic 
code, does not matter how complex the organism is. 

The coding sequence in DNA molecules can be thought 
as a message necessary to be transferred from source to 
receiver through a noisy information channel, e.g. 0, 
0. Hence, the four letter alphabet (A,T,G,C) in DNA 
should be translated into a 20 letter alphabet of amino 
acids in protein. Shannon [j| considered the generation 



of a message to be a Markov process, subject to a noise, 
and he introduced an expression for the measure of in- 
formation in a message 

H = -k^pilogpi, (1) 

i 

known as information entropy, where i denotes letters 
of the alphabet under consideration, pi represents the 
probability of the symbol i, and the coefficient k is for 
the purpose of a unit of measure. In the case of binary 
alphabet, one usually chooses logarithm base 2 in the 
above expression. Then, the amount of information in a 
message is measured with the help of an average num- 
ber of bits necessary to code for all possible messages in 
optimal way. For example, the field, SEX, in a database 
needs only one bit to code for male and female, e.g. = 
female and 1 = male. The question rises, how many bits 
are necessary to measure the amount of information car- 
ried by A,T,G,C in DNA coding sequences? If there is 
no additional information about the frequency of the oc- 
currence of the nucleotides, then two bits are necessary 
for each nucleotide. On the other hand, the codons, cod- 
ing for amino acids, need six bits of information for each 
codon, whereas the amino acids need on average only four 
bits. The additional information about the nucleotides, 
e.g., the compositional bias of DNA leading strand and 
lagging strand, makes possible to optimize the code in 
the way that the most frequent nucleotides can be rep- 
resented by the smaller number of bits than the less fre- 
quent ones. However, there is an essential problem of 
the stability of such code with respect to noise (e.g., dur- 
ing translation process). This is the case if the one-bit 
code representing the field SEX in databases had been 
accidentally substituted by its counterpart in the process 
of reading this database. Thus, the amount of informa- 
tion representing the code under consideration cannot 
be too small and it should depend on the level of noise. 
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The sense of the Shannon's channel capacity theorem is 
that if one wants to send a message from source to re- 
ceiver, with as few errors as possible, then the code of the 
message should posses some redundancy. The problem 
is how to find the suitable units measuring the genetic 
code? Yockey showed that codon cannot transfer 
1.7915 bits of its six-bit alphabet to the protein sequence 
even if there are no errors (noise) . This is almost 30% of 
the total information carried out by codons. We showed, 
that in the case of the B.burgorferi genome the lower 
bound of the fraction of nucleotides, which is not sub- 
ject to mutations could represent even about 36% of the 
nucleotide occurrence in the genes. Thus, the number of 
bits, which cannot be transferred, could be much larger. 

The necessity of the suitable weight given to nu- 
cleotides, in such a way that they be consistent with the 
amino acid structure, is notified also in papers dealing 
with DNA symmetry consideration in terms of group the- 
ory 0]. In the following, we suggested an example of the 
information weights given to nucleotides, which is suffi- 
cient for statistical analyses in the scale of whole genome. 
In particular, we showed that only one requirement on 
the capacity of the information channel, that the num- 
ber of bits transferred through the channel in a message 
(3fc-oligomer) cannot exceed a suitable threshold value, 
is sufficient to reproduce the nucleotide composition in 
each codon of the 3fc-oligomers. 

The problem discussed by us is closely related with de- 
signing DNA codes, which is important for biotechnology 
applications, e.g. in storing and retrieving information 
in synthetic DNA strands or as molecular bar codes in 
chemical libraries. This has been discussed recently by 
Marathe et al.0] (see also very rich literature within). 



II. INFORMATION WEIGHTS OF 
NUCLEOTIDES 

In papers 01,01,01, we concluded that in natural 
genome the frequency of occurrence fj of the nucleotides 
(j=A, T, G, C), in the third position in codons, is linearly 
related to the respective mean survival time Tj , 



fj = ™0 Tj 



c , 



(2) 



with the same coefficients, mo and cq, for each nucleotide. 
The coefficient mo is proportional to mutation rate u, 
experienced by the genome under consideration. This 
means, that in natural genome, with balanced mutation 
pressure and selectional pressure, the nucleotide occur- 
rences are highly correlated. This observation does not 
contradict to the Kimura's neutral theory 0] of evolu- 
tion, which assumes the constancy of the evolution rate, 
where the mutations are random events, much the same 
as the random decay events of the radioactive decay. Ac- 
tually, the mutations are random but they are correlated 
with the DNA composition. Thus, the frequency fj con- 
tains information specific for genome and therefore it 
seems to be a natural candidate to model information 



weight for nucleotides. In this case, the entropy (Eq. ^) 
of a message consisting of symbols A, T, G, C reads as 

H= £ / J log 2 (l)=<log 2 (l)>, (3) 

j=A,T,G,C ft ft 

where log 2 ( y-) represents the number of bits necessary 
to code nucleotide j in optimal way, and the brackets 
in the last expression denote an expectation value of the 
number. 

The question rises, whether we could estimate a frac- 
tion fj in the frequency fj, which is not influenced by 
nucleotide substitutions. To answer the question, we con- 
sidered the probability, that nucleotide j becomes non- 
mutated, which is equal to 



Pj = l-uJ2Wi; 



where Wi is the relative mutation probability, 



Wj=Y,Wij, 

i¥=3 



(4) 



(5) 



being a sum of the relative probabilities that nucleotide 
j will mutate to the nucleotide i, and Wa + Wt + Wq + 
Wc — 1- The parameter u represents mutation rate. 
In the case of the B. burgdorferi genome, we have found 
(0|,0|,0|) an empirical mutation table, applying for 
genes of leading DNA strand, where the values of 
are the following: 



W GA = 0.0667 W GT = 0.0347 W GC = 0.0470 

Wag = 0.1637 Wat = 0.0655 W AC = 0.0702 

W TG = 0.1157 W TA = 0.1027 W TC = 0.2613 

W CG = 0.0147 W CA = 0.0228 W CT = 0.0350 



(6) 



The probability Wj is related to the mean survival time 
in Eq. [2] as follows (derivation can be found in |l5j): 



(7) 



ln(l - u Wj) u Wj 



In the extreme case of u = 1, we can obtain the lower 
bound for the probability Pj (Eq. that nucleotide j 
becomes unchanged by mutation at some instant of time 
t. We used these values to construct information weights 
for nucleotides without the contribution of the substitu- 
tions. To this aim, we normalized the probabilities, Pj, 
and each nucleotide has been assigned a value 



Bj = log 2 (l/Pj 



(8) 



being an average number of bits necessary to code this 
nucleotide in optimal way. We have got the following 
values for the information capacity of the particular nu- 
cleotides, 



B A = 1.79457, Bt = 1.89278, S G = 2.27122,5c 



2.08743. 

(9) 
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In Fig. [2] there is presented, what is the fraction fj in 
the frequency fj of non-mutated nucleotides when u = 
1. This fraction has been obtained with the help of the 
normalized probabilities Pf. 

fj = Pjfr (10) 

In the other extreme case when u = (no substitutions) 
the slope of the corresponding line in Fig. ^ would be 
equal to 7r/4. Hence, the lower bound on the fraction of 
nucleotides, which is not subject to mutations represents 
about 36% of the nucleotide occurrence in the genes. The 
observed, in Fig. ^ linear relation between the fraction 
fj of nucleotide in coding sequences and the fraction fj 
of non-mutated nucleotides in the fraction is implicated 
by the linear law in Eq. 

The numbers in Eq. can be compared to the corre- 
sponding values originating from nucleotide occurrence 
fj in the position (3) in codons of genes. They are the 
following: 

B A = 1.7123, B T = 1.0356, B G = 2.8439, B c = 3.8841. 

(11) 

Notice, that both series of information weights share the 
same order of appearance. 



TABLE I: Average number of bits representing 20 amino acids 
calculated with the help of the PAM1 substitution table pub- 
lished by Jones et al. |1C| . 
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FIG. 1: Relation between fraction of non-mutated nucleotides 
and their frequency of occurrence in the case of the coding 
sequences of the B. burgdorferi genome. The lower bound for 
the fraction of non-mutated nucleotides is presented when the 
mutation rate u = 1. 



III. INFORMATION WEIGHTS OF AMINO 
ACIDS 

Recently 19], we showed that the mean survival time 
of amino acids depends on their frequency of occurrence 
in proteins according to a power law, 



F° 



(12) 



where j denotes 20 amino acids. The exponent has a 
negative value in the case of both selection pressure and 



TABLE II: The lower and upper bound for the average num- 
ber of bits representing 6-tuples of nucleotides in position 
(1),(2), and (3) in codons of 18-oligomers cut off from the 
genes of the B. burgdorferi genome, and the 6-tuples of the 
corresponding amino acids. 

I I amino acids I nucl.(l) I nucl.(2) I nucl.(3) I 



lower bound 


20.0778 


10.7674 


10.7674 


10.7674 


upper bound 


31.5625 


13.4435 


13.6273 


13.0651 



mutation pressure, and a positive value in the case of 
pure mutation pressure. There is no selectional data for 
single genome but there is available a table of amino acid 
substitutions published by Jones et al. > which results 
from statistical analysis of 16130 protein sequences from 
few species. The table represents so called PAM1 matrix, 
corresponding to 1 percent of substitutions between two 
compared sequences. We found 0] that a rj —1.3 for 
the table. In the case of the B. burgdorferi genome, we 
succeeded to generate a table of substitutions represent- 
ing a pure mutational pressure applied onto genes from 
leading DNA strand, and we have got a « 0.3 [l9]. The 
exponent concerning pure mutational pressure is species 
specific. 

The power law relation in Eq. 1121 confirms that the 
occurrence of amino acids in protein sequences is highly 
correlated. Therefore, for each amino acid we calculated 
the probability that it is unchanged, in the same way as 
in Eq. 0J Next, we calculated the respective information 
weights. They take the values presented in Table [I] 
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FIG. 2: The effect of the choice of the average number of bits 
Bj, representing nucleotides, on pattern recognition. There is 
shown the dependence of the overlap parameter q on the chan- 
nel capacity between computer generated 6-oligonucleotides 
and the 6-oligonucleotides in the position 3 in codons of the 
B. burgdorferi genome. 



IV. MODEL OF INFORMATION FLOW 
THROUGH A CHANNEL 

In our case, the counterpart of the Shannon's message 
in a channel is a 3fc-oligomer of nucleotides and the se- 
quence is translated into corresponding k amino acids. 
We analyzed the case of k = 3, 4, . . . , 10. The first step, 
we have done, was the partitioning of all the genes under 
consideration into non-overlapping 3fc-oligomers. Next, 
each nucleotide has been assigned information weight Bj 
(Eqs. IHlandEJ) according to the description in the previ- 
ous section. 

We analyzed all possible 3fc-oligomers, for each gene, 
and we found the lower bound and the upper bound of 
the information capacity 

k 
i=l 

of fc-nucleotides separately in the position (1), (2), and 
(3) in codons. It is known that the three subsequences of 
nucleotides, in the three positions in codons, are highly 
correlated and their composition is strongly asymmetric 
(e.g., 0]>E2)- The triplets of nucleotides (codons) have 
been translated into amino acids and the lower and up- 
per bounds have been found also for amino acids. The 
results for k = 6 (18-oligomers have been considered) 
are presented in Table |nl where the weights {Bj} which 
have been used are defined as in Eq. [5] Notice, that 
the range of the differences among the numbers of bits 
representing oligonucleotides specific for each position in 
codons of the examined 18-oligomers is of the order of 
maximum 3 bits whereas the corresponding range for the 
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FIG. 3: Dependence of the overlap parameter on the chan- 
nel capacity between computer generated ^-oligonucleotides 
and the ^-oligonucleotides in the position 3 in codons of the 
B. burgdorferi genome, when k = 3,4, 5, 6, 7, 8, 9, 10. 



amino acids is about four times larger. We would like to 
underline that if we had measured the nucleotides with 
the values of Bj introduced in Eq. ^2 the ones originat- 
ing from the nucleotide occurrences fj , then the range of 
the corresponding differences would be almost four times 
larger. This is trivial: the increase in the amount of the 
carried information implies larger flexibility of informa- 
tion packing and it increases its stability with respect to 
substitutions. It is worth to add that contrary to the 
information weights for nucleotides there is almost no 
difference in the amount of information carried by amino 
acids between the case of the weights derived from the 
PAM1 matrix and the case of weights derived from the 
frequency of occurrence of the particular amino acids. 
The result suggests that only a fraction of the coding se- 
quence in DNA molecule is considered to be important 
for the synthesis of proteins and the fraction could have 
very small packing flexibility. The tight packing of the 
nucleotide information would be consistent with our ear- 
lier observation that three subsequences of nucleotides, 
representing nucleotides in position 1,2, and 3 of codons, 
have strongly asymmetric composition (e.g., [Tl | .[T ^ |). 

One of our main results, is that in the case of suitably 
chosen information weights Bj for nucleotides, the only 
one requirement, that the number of bits in 3fc-oligomer 
cannot exceed some threshold value, is sufficient to re- 
produce the nucleotide composition in each codon of the 
3fc-oligomers. The premise of the property could be seen 
in Fig. |21 where an overlap parameter is plotted between 
the classes of the 6-nucleotides used in the third posi- 
tion in codons of genes of the B. burgdorferi and these 6- 
oligonucleotides from among all 4096 6-oligomers (there 
are 4 fe fc-tuples in the four letter alphabet), which ful- 
fill the condition that the total number of bits cannot 
exceed an assumed threshold value (information channel 
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capacity). The overlap parameter has been defined as 
follows: 



where x denotes the number of generated classes of the 
sequences, fc nucleotides long, which are exactly the same 
fc-sequences as in natural genome, y denotes the num- 
ber of generated classes of sequences, which are differ- 
ent from those in natural genome, and z denotes the 
number of classes of sequences in natural genome which 
have not been selected by the condition that the total 
number of transferred bits cannot exceed some assumed 
value. In Fig. [21 we can notice that the overlap param- 
eter can take maximum value if the average number of 
bits representing nucleotides is suitably chosen. There 
is no recognition of the compared ^-oligonucleotides in 
the case of the loss of information when all nucleotides 
have assigned the same weight Bj = 1. In the case when 
all nucleotides have assigned a weight equal to two bits 
there is only a signal that below same threshold value 
of assumed channel capacity there is no coincidence be- 
tween natural and artificial sequences. There are, in the 
figure, two curves with the largest value of maximum 
value of q, which are corresponding to two different sets 
{Bj} of information weights associated with nucleotides. 
The left curve represents information capacity calculated 
with the help of the empirical substitution table of nu- 
cleotides ^3;0I'E3 (Eq-El, whereas the right one is 
related with the frequency of occurrence of nucleotides 
in genome. In general, one could find more represen- 
tations {Bj} of information weight leading to the same 
result (the same maximum value). The trivial ones are 
those representations which correspond to other values 
of mutation rate u. We expect that the optimum repre- 
sentation should be that which requires less amount of 
redundant bits. Hence, the left curve in Fig.Ocould rep- 
resent the case with the minimal information weights for 
nucleotides necessary to be transferred to proteins. The 
results for ^-oligonucleotides with another value of fc can 
be found in Fig. [3J in the case of when the information 
weights have been chosen as in Eq. ^ 



V. DISCUSSION OF THE RESULTS 

The presence of the lower and upper bound for in- 
formation packing both in DNA sequences and protein 
sequences imposes natural boundaries in the model of 
Shannon's channel. Therefore we generated with the help 
of computer random number generator as many of 3fc- 
oligomers as there are in genes of natural genome and 
all the oligomers were fulfilling the following three con- 
ditions: 

- the frequency of occurrence of nucleotides was the 
same as in coding sequences of natural genome, sep- 
arately in position (1), (2) and (3) in codons, 



- each nucleotide is assigned a value Bj (Eqs. 19111(1 . 

- the lower bound and the upper bound for the se- 
lected fc-nucleotides in each position in codons of 
3fc-oligomers could not exceed the values of the 
lower and upper bound for genes in natural genome 
as well as the triplets of nucleotides from 3fc- 
oligomers could not exceed the lower and the upper 
bound for amino acids after translation of the con- 
sidered sequence of nucleotides into a sequence of 
amino acids. 

We found, that the distributions of the generated 
by computer fc-nucleotides in all nucleotide positions 
in codons coincide up to the noise introduced by 
the over-represented and under-representated oligonu- 
cleotides with the corresponding distributions in natu- 
ral genome. This could be seen in the Figs. I IKil where 
we placed all generated oligomers and natural ones in 
a space [A,T,G,C] with the help of IFS (Iterated Func- 
tion System) transformation |2Cj . In the case of k = 6, 
the points of the space [A,T,G,C] represent all possible 
4096 classes of fc-tuples of nucleotides and the hills rep- 
resent the numbers of the same sequences in the class. 
The detailed construction of the [A,T,G,C] space can 
be found in our paper |25| . The representation is sim- 
ilar to the chaos game representation of DNA sequences 
in the form of fractal images first developed by Jeffrey 
|2l| and followed by others, e.g., j2^|. The size of the 
hills is closely related to the mutation pressure and se- 
lection. The particular case of the statistical properties 
of short oligonucleotides have been discussed recently by 
Buldyrev et al. (24|. In particular, they showed that the 
number of dimeric tandem repeats in coding DNA se- 
quences is exponential, whereas in non-coding sequences 
it is more often described by a power law. Other anal- 
ysis of the fc-oligomers, like Zipf analysis, can be found 
elsewhere, e.g., [2j|,[27j. 

In the Figs. I IKil the best result with respect to com- 
parison of the natural sequences with the reconstructed 
oligomers we have got for the third position in codons, 
whereas the worst one we have got for the second posi- 
tion in codons. However, the reconstruction of the sec- 
ond position in codons and the first one also shares many 
features common with the natural genome. The choice 
of another set of {Bj}, the one originating from the nu- 
cleotide occurrence fj (Eq. Illfl . leads to very similar re- 
sults. We showed this in Fig.[7|for position (2) in codons. 
In the figure, all fc-oligonucleotides have been assigned a 
rank with respect to their occurrence and there is plotted 
their number in each number 4096 classes versus the rank. 
The three cases are plotted in the figure: the number of 
fc-nucleotides in natural genome (continuous line), the 
number of fc-nucleotides in generated sequences, where 
the information weights have been defined in Eq.[§l(dots), 
and where the information weights have been defined in 
Eq. ^2 (crosses). As we can see, even in the case of the 
weakest reconstruction of DNA oligomers for the position 
(2) in codons, the number of representants of each of the 
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FIG. 6 



The same as in Fig. 2] but for position 1 in codons. 
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FIG. 7: Number of representants in 4096 classes of 6- 
sequences sorted with respect to their rank number in the 
case of : 6-nucleotides originating from position (2) in 
codons of the B. burgdorferi genome (the hills in the left side 
Fig. |3, 6-nucleotides generated by computer (the hills in the 
right side Fig. |SJ where information weights are defined in 
Eq. [5] 6-nucleotides generated by computer, where informa- 
tion weights are defined in Eq. 1111 

4096 classes well approximates the corresponding num- 
ber in natural genome. Much better approximation we 
have got for the first position in codons and the third one. 
It seems, that suggested by us information weights Bj, 
in Eq. El basying on the mutation table for nucleotides, 
estimate information carried out by nucleotides better 
because they correspond to the smaller size of the infor- 
mation channel. This could be as in the example, given 
by Yockey Q , of bar code attached to packages items in 
stores that permits the cashier to record the price of the 
item. Namely, the amount of information estimated by 
us with the help of mutation table for nucleotides rep- 
resents the sense code whereas the remaining part of it 
represents the redundant bits. Hence, the frequency of 
occurrence of nucleotides in natural genome represents 
two types of information being compromise between se- 
lection and mutation pressure. It is worth to add, that 



analogous information redundancy in protein sequences 
is very small. 

The obtained by us reconstruction of DNA sequences 
with the help of only one rule, that the number of trans- 
ferred bits cannot exceed some threshold value, suggests 
that the width of information channel is the basic mech- 
anism of the information packing in DNA coding se- 
quences. This is consistent with the statement, that if 
the genetic code is universal, i.e. concerning all living 
organisms, then it must follow the same simple rules of 
coding. 

There is different research done in the field of com- 
binatorial DNA words design by Marathe et al. [T^| . 
in which they discuss some constrains imposed on con- 
structed code, like Hamming constraint, free energy con- 
straint etc. Their paper is one of the papers dealing with 
the study of biotechnological applications of DNA infor- 
mation. Our result, basying on the usage of the table of 
substitution rates for nucleotides and amino acids, could 
be also used as a new possibility of the designing the code 
of synthetic DNA for biotechnology purposes. 



VI. CONCLUSIONS 

Our results suggest that genetic code imposes very 
tight packing of the nucleotide information in DNA se- 
quences. There is a fraction in the frequency of occur- 
rence of nucleotide in coding sequences which represents 
the minimum inherited information whereas the remain- 
ing part concerns redundant information, ensuring that 
the DNA code is stable against mutations. This observa- 
tion is not contradictory with the Kimura's neutral the- 
ory [18j of evolution. The mutations are random events 
but they are correlated with the DNA composition. 

There is possible that our method of the construction 
of the suitable information weights of nucleotides could 
be used to discriminate genes in DNA sequences. 

There is also possibility to use our method in some 
biotechnological applications dealing with designing syn- 
thetic DNA code. 
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